| Country | Region | Happiness.Rank | Happiness.Score | Standard.Error | Economy..GDP.per.Capita. | Family | Health..Life.Expectancy. | Freedom | Trust..Government.Corruption. | Generosity | Dystopia.Residual |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Switzerland | Western Europe | 1 | 7.587 | 0.03411 | 1.39651 | 1.34951 | 0.94143 | 0.66557 | 0.41978 | 0.29678 | 2.51738 |
| Iceland | Western Europe | 2 | 7.561 | 0.04884 | 1.30232 | 1.40223 | 0.94784 | 0.62877 | 0.14145 | 0.43630 | 2.70201 |
| Denmark | Western Europe | 3 | 7.527 | 0.03328 | 1.32548 | 1.36058 | 0.87464 | 0.64938 | 0.48357 | 0.34139 | 2.49204 |
In addition, we wanted to add futher factors and added the following three datasets:
By merging the datasets we have now four additional factors:
To join all the different datasets we had to do some manual preprocessing which can be seen in the preprocessing step. The main steps where cleaning the data (region, countrycode, NaN) and joining the datasets based on the year and the countrycode.
After joining we noticed, that the three additional data sets do not contain data for the whole timespan 2015-2022.(fig. missing values full data) Therefore, we decided to create two datasets. One for analysing the happiness change over time and one for analysing the influential factors regarding happiness in only one year.
For the first dataset, the over time analysis, we only included the 6 factors from the base happiness dataset and excluded all rows containing missing values. We also renamed the columns for having shorter labels.| Country | Happiness.Rank | Happiness | Economy | Family | Health | Freedom | Trust | Generosity | Year | Region |
|---|---|---|---|---|---|---|---|---|---|---|
| Switzerland | 1 | 7.587 | 1.39651 | 1.34951 | 0.94143 | 0.66557 | 0.41978 | 0.29678 | 2015 | Western Europe |
| Iceland | 2 | 7.561 | 1.30232 | 1.40223 | 0.94784 | 0.62877 | 0.14145 | 0.43630 | 2015 | Western Europe |
| Denmark | 3 | 7.527 | 1.32548 | 1.36058 | 0.87464 | 0.64938 | 0.48357 | 0.34139 | 2015 | Western Europe |
For the second dataset, the influential factors analysis, we inspected the missing values of each year and choose the year with the lowes missing values, year 2018 (fig “missing values 2018”). Then we excluded all rows containing missing values again. Figure “missing values 2017” shows e.g. that the smoking and the alcohol dataset did not contain any values for the year 2017. We also renamed the columns for having shorter labels.
| Country | Happiness.Rank | Happiness | Economy | Family | Health | Freedom | Trust | Generosity | Year | Region | Country.Code | Code | Alcohol | Population | Tobacco | Internet |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Finland | 1 | 7.632 | 1.305 | 1.592 | 0.874 | 0.681 | 0.393 | 0.202 | 2018 | Western Europe | FI | FIN | 10.78 | 5522585 | 19.7 | 88.88996 |
| Norway | 2 | 7.594 | 1.456 | 1.582 | 0.861 | 0.686 | 0.340 | 0.286 | 2018 | Western Europe | NO | NOR | 7.41 | 5337960 | 13.0 | 96.49166 |
| Denmark | 3 | 7.555 | 1.351 | 1.590 | 0.868 | 0.683 | 0.408 | 0.284 | 2018 | Western Europe | DK | DNK | 10.26 | 5752131 | 18.6 | 97.31920 |
missing values full data
missing values 2017
missing values 2018
One of the objectives of preliminary data analysis to get a feel for the data you are dealing with by describing the key features of the data and summarizing the results. We are focusing on the second dataset, the influential factors analysis dataset, as it contains the most explanatory variables.
First we check via the summary how all the explanatory variables are distributed. As we can see they are on different scales, especially “population” and “Internet usage”. As we don’t want to have the following analysis be more driven on the larges distances, we scale them by \(\frac{(x - mean(x))}{sd(x)}\)
## Happiness Economy Family Health
## Min. :2.905 Min. :0.0760 Min. :0.372 Min. :0.0000
## 1st Qu.:4.486 1st Qu.:0.7040 1st Qu.:1.063 1st Qu.:0.4475
## Median :5.483 Median :1.0100 Median :1.314 Median :0.6750
## Mean :5.489 Mean :0.9335 Mean :1.247 Mean :0.6283
## 3rd Qu.:6.332 3rd Qu.:1.2240 3rd Qu.:1.481 3rd Qu.:0.8180
## Max. :7.632 Max. :1.5760 Max. :1.644 Max. :1.0080
##
## Freedom Trust Generosity Alcohol
## Min. :0.0250 Min. :0.0000 Min. :0.0000 Min. : 0.003
## 1st Qu.:0.3875 1st Qu.:0.0500 1st Qu.:0.1020 1st Qu.: 3.220
## Median :0.5040 Median :0.0880 Median :0.1670 Median : 7.150
## Mean :0.4758 Mean :0.1195 Mean :0.1840 Mean : 6.842
## 3rd Qu.:0.5835 3rd Qu.:0.1450 3rd Qu.:0.2545 3rd Qu.:10.385
## Max. :0.7240 Max. :0.4570 Max. :0.5980 Max. :15.090
##
## Population Tobacco Internet
## Min. :3.367e+05 Min. : 3.70 Min. : 4.10
## 1st Qu.:5.488e+06 1st Qu.:13.90 1st Qu.:37.60
## Median :1.444e+07 Median :22.20 Median :68.21
## Mean :6.007e+07 Mean :22.02 Mean :60.43
## 3rd Qu.:4.430e+07 3rd Qu.:27.90 3rd Qu.:82.81
## Max. :1.428e+09 Max. :45.50 Max. :99.60
##
## Region
## Sub-Saharan Africa :27
## Western Europe :20
## Latin America and Caribbean :14
## Central and Eastern Europe :13
## Middle East and North Africa :12
## Commonwealth of Independent States: 9
## (Other) :20
We can see that every factor is now on the same scale. We have some outliers for Family, Freedom, Trust, Generosity and Population.
On the correlation matrix plot we see, that happiness has the stronges correlation with Economy (0.833) and Internet (0.817). For the correlations between the explanatory variables the following stand out:
One tool for getting a first glance on what influences happiness is linear regression. For the regression we use the unscaled data. If our linear model has good predictability, we can interpret the coefficients on how they influence the outcome. This is also called regression analysis, where the goal is to isolate the relationship between each explanatory variable and the outcome variable.
However, the interpretability assumes that you can only change the value of one explanatory variable and not the others at the same time. This of course is only true if there are no correlations between the explanatory variables. If this independence does not hold, we have a problem of multicollinearity. This can result in the coefficients swingging wildly based on which other independent variables are in the model. Therefore the coefficients become very sensitive to small changes in the model and can not be easily interpreted.
One way to asses how strong the explanatory variables are affected by multicollinearity is using the variance inflation factor (VIF). VIFs identify correlations and their strength. VIFs between 1 and 5 suggest that there is a small correlation, VIFs greater than 5 represent critical levels of multicollinearity where the coefficients are poorly estimated.
If we build a linear regression model on all explanatory variables, we get an R-squared of 0.8303. However, by plotting the VIF values we can see that a model based on all explanatory variables has severe multicollinearity. Therefore we can not interprete the coefficients for Internet and Economy.
##
## Call:
## lm(formula = Happiness ~ ., data = not_scaled_data_factors)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.52814 -0.29381 0.02782 0.30897 1.30352
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.139e+00 2.513e-01 8.510 1.39e-13 ***
## Economy 8.474e-01 4.030e-01 2.103 0.037923 *
## Family 1.004e+00 2.732e-01 3.677 0.000376 ***
## Health 8.986e-01 4.190e-01 2.144 0.034323 *
## Freedom 7.349e-01 4.203e-01 1.749 0.083302 .
## Trust 5.337e-01 5.615e-01 0.951 0.343986
## Generosity 1.129e+00 5.072e-01 2.225 0.028256 *
## Alcohol 3.129e-03 1.352e-02 0.231 0.817491
## Population 6.294e-11 2.692e-10 0.234 0.815587
## Tobacco -2.120e-02 5.560e-03 -3.812 0.000234 ***
## Internet 9.302e-03 5.284e-03 1.760 0.081276 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4937 on 104 degrees of freedom
## Multiple R-squared: 0.8303, Adjusted R-squared: 0.814
## F-statistic: 50.89 on 10 and 104 DF, p-value: < 2.2e-16
If we build a linear regression model without Internet and Economy, we get an R-squared of 0.7924. This R-squared is lower than prior, but after plotting the VIF values we can see that we are allowed to interprete the coefficients for the remaining explanatory variables.
Interesting is that only Family, Health and Tabacco is statistically significant:
##
## Call:
## lm(formula = Happiness ~ . - Internet - Economy, data = not_scaled_data_factors)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.5823 -0.3292 0.0333 0.3509 1.3621
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.836e+00 2.666e-01 6.886 4.21e-10 ***
## Family 1.598e+00 2.671e-01 5.984 3.00e-08 ***
## Health 2.404e+00 3.057e-01 7.864 3.31e-12 ***
## Freedom 7.007e-01 4.587e-01 1.528 0.12961
## Trust 9.984e-01 6.059e-01 1.648 0.10237
## Generosity 6.073e-01 5.377e-01 1.129 0.26124
## Alcohol 1.741e-04 1.479e-02 0.012 0.99063
## Population -2.081e-11 2.842e-10 -0.073 0.94177
## Tobacco -1.884e-02 6.036e-03 -3.122 0.00231 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5409 on 106 degrees of freedom
## Multiple R-squared: 0.7924, Adjusted R-squared: 0.7767
## F-statistic: 50.58 on 8 and 106 DF, p-value: < 2.2e-16
Next we tried out the linear regrssion methods with shrinkage. For Lasso and Ridge regression all predictor variables should be scaled so that they have the same standard deviation. Otherwise, the predictor variables have weighting in the penalty term. The glmnet() function however standardises the predictors by default and the output coefficients are recalculated to apply to the original scale.
The Ridge regression
x <- model.matrix(Happiness ~ . -Internet - Economy , data = not_scaled_data_factors)[, -1]
y <- not_scaled_data_factors$Happiness
"Ridge Regression"
## [1] "Ridge Regression"
ridge.out <- cv.glmnet(x,y, alpha = 0)
coef(ridge.out)
## 9 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 2.418451e+00
## Family 1.158007e+00
## Health 1.530776e+00
## Freedom 9.983247e-01
## Trust 1.207948e+00
## Generosity 3.723445e-01
## Alcohol 1.564960e-02
## Population -1.074559e-10
## Tobacco -5.645452e-03
The Lasso Regression
"Lasso Regression"
## [1] "Lasso Regression"
lasso.out <- cv.glmnet(x,y, alpha = 1)
coef(lasso.out)
## 9 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 2.065732886
## Family 1.476035508
## Health 1.997754757
## Freedom 0.733453871
## Trust 0.934505831
## Generosity .
## Alcohol .
## Population .
## Tobacco -0.006076065
geography map (color each country base on the percentage change over time (2015-2022))
box <- ggplot(data_2018, aes(x = Region, y = Happiness, color = Region), ) +
geom_boxplot() +
geom_jitter(aes(color=Country), size = 0.5) +
ggtitle("Happiness Score for Regions and Countries") +
coord_flip() +
theme(legend.position="none")
ggplotly(box)